103 research outputs found

    Predicting Dynamic Memory Requirements for Scientific Workflow Tasks

    Full text link
    With the increasing amount of data available to scientists in disciplines as diverse as bioinformatics, physics, and remote sensing, scientific workflow systems are becoming increasingly important for composing and executing scalable data analysis pipelines. When writing such workflows, users need to specify the resources to be reserved for tasks so that sufficient resources are allocated on the target cluster infrastructure. Crucially, underestimating a task's memory requirements can result in task failures. Therefore, users often resort to overprovisioning, resulting in significant resource wastage and decreased throughput. In this paper, we propose a novel online method that uses monitoring time series data to predict task memory usage in order to reduce the memory wastage of scientific workflow tasks. Our method predicts a task's runtime, divides it into k equally-sized segments, and learns the peak memory value for each segment depending on the total file input size. We evaluate the prototype implementation of our method using workflows from the publicly available nf-core repository, showing an average memory wastage reduction of 29.48% compared to the best state-of-the-art approac

    Towards Energy Consumption and Carbon Footprint Testing for AI-driven IoT Services

    Full text link
    Energy consumption and carbon emissions are expected to be crucial factors for Internet of Things (IoT) applications. Both the scale and the geo-distribution keep increasing, while Artificial Intelligence (AI) further penetrates the "edge" in order to satisfy the need for highly-responsive and intelligent services. To date, several edge/fog emulators are catering for IoT testing by supporting the deployment and execution of AI-driven IoT services in consolidated test environments. These tools enable the configuration of infrastructures so that they closely resemble edge devices and IoT networks. However, energy consumption and carbon emissions estimations during the testing of AI services are still missing from the current state of IoT testing suites. This study highlights important questions that developers of AI-driven IoT services are in need of answers, along with a set of observations and challenges, aiming to help researchers designing IoT testing and benchmarking suites to cater to user needs.Comment: Presented at the 2nd International Workshop on Testing Distributed Internet of Things Systems (TDIS 2022

    Selecting Efficient Cluster Resources for Data Analytics: When and How to Allocate for In-Memory Processing?

    Full text link
    Distributed dataflow systems such as Apache Spark or Apache Flink enable parallel, in-memory data processing on large clusters of commodity hardware. Consequently, the appropriate amount of memory to allocate to the cluster is a crucial consideration. In this paper, we analyze the challenge of efficient resource allocation for distributed data processing, focusing on memory. We emphasize that in-memory processing with in-memory data processing frameworks can undermine resource efficiency. Based on the findings of our trace data analysis, we compile requirements towards an automated solution for efficient cluster resource allocation.Comment: 4 pages, 3 Figures; ACM SSDBM 202

    Ruya: Memory-Aware Iterative Optimization of Cluster Configurations for Big Data Processing

    Full text link
    Selecting appropriate computational resources for data processing jobs on large clusters is difficult, even for expert users like data engineers. Inadequate choices can result in vastly increased costs, without significantly improving performance. One crucial aspect of selecting an efficient resource configuration is avoiding memory bottlenecks. By knowing the required memory of a job in advance, the search space for an optimal resource configuration can be greatly reduced. Therefore, we present Ruya, a method for memory-aware optimization of data processing cluster configurations based on iteratively exploring a narrowed-down search space. First, we perform job profiling runs with small samples of the dataset on just a single machine to model the job's memory usage patterns. Second, we prioritize cluster configurations with a suitable amount of total memory and within this reduced search space, we iteratively search for the best cluster configuration with Bayesian optimization. This search process stops once it converges on a configuration that is believed to be optimal for the given job. In our evaluation on a dataset with 1031 Spark and Hadoop jobs, we see a reduction of search iterations to find an optimal configuration by around half, compared to the baseline.Comment: 9 pages, 5 Figures, 3 Tables; IEEE BigData 2022. arXiv admin note: substantial text overlap with arXiv:2206.1385

    Detecting and Mitigating Network Packet Overloads on Real-Time Devices in IoT Systems

    Get PDF
    Manufacturing, automotive, and aerospace environments use embedded systems for control and automation and need to fulfill strict real-time guarantees. To facilitate more efficient business processes and remote control, such devices are being connected to IP networks. Due to the difficulty in predicting network packets and the interrelated workloads of interrupt handlers and drivers, devices controlling time critical processes stand under the risk of missing process deadlines when under high network loads. Additionally, devices at the edge of large networks and the internet are subject to a high risk of load spikes and network packet overloads. In this paper, we investigate strategies to detect network packet overloads in real-time and present four approaches to adaptively mitigate local deadline misses. In addition to two strategies mitigating network bursts with and without hysteresis, we present and discuss two novel mitigation algorithms, called Budget and Queue Mitigation. In an experimental evaluation, all algorithms showed mitigating effects, with the Queue Mitigation strategy enabling most packet processing while preventing lateness of critical tasks.Comment: EdgeSys '2

    How Workflow Engines Should Talk to Resource Managers: A Proposal for a Common Workflow Scheduling Interface

    Full text link
    Scientific workflow management systems (SWMSs) and resource managers together ensure that tasks are scheduled on provisioned resources so that all dependencies are obeyed, and some optimization goal, such as makespan minimization, is fulfilled. In practice, however, there is no clear separation of scheduling responsibilities between an SWMS and a resource manager because there exists no agreed-upon separation of concerns between their different components. This has two consequences. First, the lack of a standardized API to exchange scheduling information between SWMSs and resource managers hinders portability. It incurs costly adaptations when a component should be replaced by another one (e.g., an SWMS with another SWMS on the same resource manager). Second, due to overlapping functionalities, current installations often actually have two schedulers, both making partial scheduling decisions under incomplete information, leading to suboptimal workflow scheduling. In this paper, we propose a simple REST interface between SWMSs and resource managers, which allows any SWMS to pass dynamic workflow information to a resource manager, enabling maximally informed scheduling decisions. We provide an exemplary implementation of this API for Nextflow as an SWMS and Kubernetes as a resource manager. Our experiments with nine real-world workflows show that this strategy reduces makespan by up to 25.1% and 10.8% on average compared to the standard Nextflow/Kubernetes configuration. Furthermore, a more widespread implementation of this API would enable leaner code bases, a simpler exchange of components of workflow systems, and a unified place to implement new scheduling algorithms.Comment: Paper accepted in: 2023 23rd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid

    Lotaru: Locally Predicting Workflow Task Runtimes for Resource Management on Heterogeneous Infrastructures

    Full text link
    Many resource management techniques for task scheduling, energy and carbon efficiency, and cost optimization in workflows rely on a-priori task runtime knowledge. Building runtime prediction models on historical data is often not feasible in practice as workflows, their input data, and the cluster infrastructure change. Online methods, on the other hand, which estimate task runtimes on specific machines while the workflow is running, have to cope with a lack of measurements during start-up. Frequently, scientific workflows are executed on heterogeneous infrastructures consisting of machines with different CPU, I/O, and memory configurations, further complicating predicting runtimes due to different task runtimes on different machine types. This paper presents Lotaru, a method for locally predicting the runtimes of scientific workflow tasks before they are executed on heterogeneous compute clusters. Crucially, our approach does not rely on historical data and copes with a lack of training data during the start-up. To this end, we use microbenchmarks, reduce the input data to quickly profile the workflow locally, and predict a task's runtime with a Bayesian linear regression based on the gathered data points from the local workflow execution and the microbenchmarks. Due to its Bayesian approach, Lotaru provides uncertainty estimates that can be used for advanced scheduling methods on distributed cluster infrastructures. In our evaluation with five real-world scientific workflows, our method outperforms two state-of-the-art runtime prediction baselines and decreases the absolute prediction error by more than 12.5%. In a second set of experiments, the prediction performance of our method, using the predicted runtimes for state-of-the-art scheduling, carbon reduction, and cost prediction, enables results close to those achieved with perfect prior knowledge of runtimes

    Magpie: Automatically Tuning Static Parameters for Distributed File Systems using Deep Reinforcement Learning

    Full text link
    Distributed file systems are widely used nowadays, yet using their default configurations is often not optimal. At the same time, tuning configuration parameters is typically challenging and time-consuming. It demands expertise and tuning operations can also be expensive. This is especially the case for static parameters, where changes take effect only after a restart of the system or workloads. We propose a novel approach, Magpie, which utilizes deep reinforcement learning to tune static parameters by strategically exploring and exploiting configuration parameter spaces. To boost the tuning of the static parameters, our method employs both server and client metrics of distributed file systems to understand the relationship between static parameters and performance. Our empirical evaluation results show that Magpie can noticeably improve the performance of the distributed file system Lustre, where our approach on average achieves 91.8% throughput gains against default configuration after tuning towards single performance indicator optimization, while it reaches 39.7% more throughput gains against the baseline.Comment: Accepted at The IEEE International Conference on Cloud Engineering (IC2E) conference 202

    Towards a Cognitive Compute Continuum: An Architecture for Ad-Hoc Self-Managed Swarms

    Get PDF
    In this paper we introduce our vision of a Cognitive Computing Continuum to address the changing IT service provisioning towards a distributed, opportunistic, self-managed collaboration between heterogeneous devices outside the traditional data center boundaries. The focal point of this continuum are cognitive devices, which have to make decisions autonomously using their on-board computation and storage capacity based on information sensed from their environment. Such devices are moving and cannot rely on fixed infrastructure elements, but instead realise on-the-fly networking and thus frequently join and leave temporal swarms. All this creates novel demands for the underlying architecture and resource management, which must bridge the gap from edge to cloud environments, while keeping the QoS parameters within required boundaries. The paper presents an initial architecture and a resource management framework for the implementation of this type of IT service provisioning.Comment: 8 pages, CCGrid 2021 Cloud2Things Worksho
    • …
    corecore